Skip to content

HDDS-14183. Attempted to decrement available space to a negative value#9655

Merged
sumitagrawl merged 8 commits intoapache:masterfrom
sarvekshayr:HDDS-14183
Mar 26, 2026
Merged

HDDS-14183. Attempted to decrement available space to a negative value#9655
sumitagrawl merged 8 commits intoapache:masterfrom
sarvekshayr:HDDS-14183

Conversation

@sarvekshayr
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Saw this warning when datanode disk was nearly full:

2025-12-15 10:37:20,903 WARN [166c5ca8-343e-46ed-b619-84ac193e0069-ChunkReader-215]-org.apache.hadoop.hdds.fs.CachingSpaceUsageSource: Attempted to decrement available space to a negative value. Current: 0, Decrement: 1048576, Source: /ipdr_ozone31/hadoop-ozone/datanode/data

Prior to this message, there were many failed writes. Perhaps it needs to increment the value when the write fails.

The fix adds rollback logic in KeyValueHandler.handleWriteChunk() that tracks when a chunk write succeeds and increments the usedSpace counter. If any subsequent operation fails, the exception handler calls volume.decrementUsedSpace() to restore the counter.

What is the link to the Apache JIRA

HDDS-14183

How was this patch tested?

CI: https://github.com/sarvekshayr/ozone/actions/runs/21200210393

@spacemonkd spacemonkd added the bug Something isn't working label Jan 28, 2026
@github-actions
Copy link
Copy Markdown

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions Bot added the stale label Feb 19, 2026
Copy link
Copy Markdown
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that the root cause is not bad exception handling, but actually not implementing idempotency correctly in the overwrite case.

In FilePerBlockStrategy#writeChunk, containerData.updateWriteStats(chunkLength, overwrite) is called at the very end. At that point any exception is already thrown. So it's correct to update write stats at that point.

And updateWriteStats always calls incrWriteBytes(bytesWritten), which always calls volume incrementUsedSpace(bytes):

  public void updateWriteStats(long bytesWritten, boolean overwrite) {
    getStatistics().updateWrite(bytesWritten, overwrite);
    incrWriteBytes(bytesWritten); <--------------- this, always called
  }

  private void incrWriteBytes(long bytes) {
    this.getVolume().incrementUsedSpace(bytes); <-------------- this, always called
    long bytesUsedBeforeWrite = getBytesUsed() - bytes;
    long availableSpaceBeforeWrite = getMaxSize() - bytesUsedBeforeWrite;
    if (committedSpace && availableSpaceBeforeWrite > 0) {
      long decrement = Math.min(bytes, availableSpaceBeforeWrite);
      this.getVolume().incCommittedBytes(-decrement);
    }
  }

So used space is increased and available space is decreased even on overwrite. For overwrites that don't actually grow the file on disk, this is wrong and it's probably what leads to available space being decreased more than what's possible.

For example:

chunkLength = 1,048,576 (1MB),
block file already length 1,048,576,
retry writes same chunk at offset = 0 (overwrite = true),
so after write, file length still 1,048,576 (no growth). But used space is increased and available space is decreased incorrectly.

So this is what needs to be fixed. I also suggest trying to keep it simple as we are already using the term reserved space in other parts of the code, and that means something else. This area is pretty complex at this point!

@github-actions
Copy link
Copy Markdown

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions Bot added the stale label Mar 24, 2026
Copy link
Copy Markdown
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @sarvekshayr If you can add some code comments to explain how this part works, I think that'd help other reviewers reason about it as well.

Copy link
Copy Markdown
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sumitagrawl sumitagrawl merged commit f3cd59a into apache:master Mar 26, 2026
45 checks passed
@sarvekshayr
Copy link
Copy Markdown
Contributor Author

Thanks @sumitagrawl, @siddhantsangwan, @yandrey321 and @sreejasahithi for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants